# Visual Instruction Fine-tuning

Mistral Small 3.1 24B Instruct 2503 GGUF
Apache-2.0
This is a vision-enhanced version based on Mistral-Small-3.1-24B-Instruct-2503, supporting image-to-text generation tasks.
Image-to-Text
M
ggml-org
670
3
General Reasoner 14B Preview
Apache-2.0
A multimodal reasoning model trained on the Qwen2.5-14B base model and VisualWebInstruct-Verified dataset, supporting English task processing.
Large Language Model Transformers English
G
TIGER-Lab
33
3
Qwen2.5 VL 32B Instruct GGUF
Apache-2.0
Qwen2.5-VL-32B-Instruct is a multimodal vision-language model that supports joint understanding and generation tasks for both images and text.
Image-to-Text English
Q
samgreen
25.59k
6
Llama 3.2 Vision Instruct Bpmncoder
Apache-2.0
Llama 3.2 11B vision instruction fine-tuned model optimized with Unsloth, using 4-bit quantization technology, achieving 2x faster training speed
Text-to-Image Transformers English
L
utkarshkingh
40
1
Qwen2.5 VL 72B Instruct GGUF
Other
Qwen2.5-VL-72B-Instruct is a multimodal vision-language model that supports interactive generation tasks involving images and text.
Image-to-Text English
Q
samgreen
2,073
1
Llama 3.2 11B Vision Medical
Apache-2.0
A model fine-tuned based on unsloth/Llama-3.2-11B-Vision-Instruct, trained using Unsloth and Huggingface's TRL library, achieving a 2x speedup.
Text-to-Image Transformers English
L
Varu96
25
1
Llama 3.2 11B Vision Invoices Mini
Apache-2.0
A multimodal large language model fine-tuned based on unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit, supporting visual instruction understanding tasks, with Unsloth optimization doubling training speed.
Text-to-Image Transformers English
L
atulSethi
46
1
Llama 3.2 11B Vision Radiology Mini
Apache-2.0
Vision instruction fine-tuned model optimized with Unsloth, supporting multimodal task processing
Text-to-Image Transformers English
L
mervinpraison
39
2
Vsft Llava 1.5 7b Hf Trl
A multimodal vision-language model based on LLaVA-1.5-7B trained through Visual Supervised Fine-Tuning (VSFT), supporting image understanding and dialogue generation
Image-to-Text Transformers English
V
HuggingFaceH4
65
14
Llava V1.5 Mlp2x 336px Pretrain Vicuna 13b V1.5
LLaVA is an open-source multimodal chatbot, fine-tuned on GPT-generated multimodal instruction-following data based on LLaMA/Vicuna.
Text-to-Image Transformers
L
liuhaotian
66
2
Llava Pretrain Vicuna 7b V1.3
LLaVA is an open-source multimodal chatbot, fine-tuned on GPT-generated multimodal instruction-following data based on LLaMA/Vicuna.
Text-to-Image Transformers
L
liuhaotian
54
1
Chinese LLaVA Cllama2
Openrail
An open-source, commercially available bilingual (Chinese-English) vision-language assistant that supports multimodal dialogue in both Chinese and English.
Text-to-Image Transformers Supports Multiple Languages
C
LinkSoul
51
19
Instructblip Flan T5 Xl
MIT
InstructBLIP is the vision-instruction fine-tuned version of BLIP-2, capable of performing vision-language tasks such as image caption generation and visual question answering.
Image-to-Text Transformers English
I
Salesforce
16.89k
29
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase